Some quick tips
April 21, 2020
Some quick tips
Write comments in your code after # (in Rmd docs # only works within chunks)
my.vec1 <- c("some","word") # this is a comment
my.vec2 <- c("some","other","word") # this is also a comment
save(list=c("my.vec1","my.vec2"),file = "MyCharVecs.RData")
load("MyCharVecs.RData")
Where are these files being saved to and loaded from?
R saves and looks for files in your current working directory. To see what it is, use:
getwd()
## [1] "/cloud/project"
You can also set your session to a working directory
setwd("C:/Users/dtr/theDirectory")
Working dirs in R Markdown docs are set automatically to where the Rmd file is stored
Give each project (e.g., a homework) its own folder. Here is my system:
Every class or project has its own folder
Each assignment or task has a folder inside that, which is the working directory for that item.
.Rmd and .R files are named clearly and completely
For example, this presentation is located and named this:
Lectures/Lectures_Week04/DataVisualization02.Rmd
Use whatever system you want, but be consistent!
A statistical graphic is a…
ggplot2 is based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts
It takes care of many of the fiddly details that make plotting a hassle (like drawing legends or faceting (e.g., legend, mfrow, mfcol, layout)
Powerful model for graphical representation of data, simplifies making complex multi-layered visualizations
ggplotggplot2 package build plots by layersdata ggplot
geometry: geom_point, geom_line, geom_smooth, geom_bar, …
titles and axis labels: ggtitle, lab, xlab, ylab
themes: theme, theme_bw, theme_classic, …
facets: facet_wrap and facet_grid
Layers are separated by a + sign.
ggplotggplot2 defines aesthetics within each layer
Aesthetics, to control the appearance of the layers (e.g., point/line colors or transparency – alpha between 0 and 1)
x, y: \(x\) and \(y\) coordinate values to usecolor: set color of elements based on some data valuegroup: describe which points are conceptually grouped together for the plot (often used with lines)size: set size of points/lines based on some data valuealpha: set transparency based on some data valueggplotThese don’t depend on the data and can be specified directly on the layers
Some are: color, size, linetype, shape, fill, and alpha
See the ggplot2 documentation
aes() that depend on the data, e.g. geom_point(aes(color = continent))aes() in the ggplot() layer gives overall aesthetics to use in other layersaes() can be changed on individual layersggplotdata(gapminder) China <- gapminder[gapminder$country == "China",] head(China, 4)
## # A tibble: 4 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 China Asia 1952 44 556263527 400. ## 2 China Asia 1957 50.5 637408000 576. ## 3 China Asia 1962 44.5 665770000 488. ## 4 China Asia 1967 58.4 754550000 613.
ggplot: the base plotggplot(data = China,
aes(x = year, y = lifeExp))
ggplot: the geometryggplot(data = China,
aes(x = year, y = lifeExp)) +
geom_point()
ggplot: some aestheticsggplot(data = China,
aes(x = year, y = lifeExp)) +
geom_point(color = "red", size = 3)
ggplot: axis labelsggplot(data = China,
aes(x = year, y = lifeExp)) +
geom_point(color = "red", size = 3) +
xlab("Year")
ggplot: axis labelsggplot(data = China,
aes(x = year, y = lifeExp)) +
geom_point(color = "red", size = 3) +
xlab("Year") + ylab("Life Expectancy")
ggplot: titleggplot(data = China,
aes(x = year, y = lifeExp)) +
geom_point(color = "red", size = 3) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy in China")
ggplot: themeggplot(data = China,
aes(x = year, y = lifeExp)) +
geom_point(color = "red", size = 3) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy in China") +
theme_bw()
ggplot: themeggplot(data = China,
aes(x = year, y = lifeExp)) +
geom_point(color = "red", size = 3) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy in China") +
theme_bw(base_size = 14)
ggplotggplot(data = gapminder,
aes(x = year, y = lifeExp)) +
geom_point(color = "red", size = 3) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy in China") +
theme_bw(base_size = 14)
Can’t separate countries.
ggplotggplot(data = gapminder,
aes(x = year, y = lifeExp)) +
geom_line(color = "red") +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy in China") +
theme_bw(base_size = 14)
ggplot can’t tell them apart, need to tell it how!
ggplot: group aestheticggplot to group by countryggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country)) +
geom_line(color = "red") +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy in China") +
theme_bw(base_size = 14)
Let’s also make the lines narrower
Are there patterns by continent?
ggplot: color aestheticggplot to color by continentggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country,
colour = continent)) +
geom_line(color = "red",
lwd = 0.3) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy over time") +
theme_bw(base_size = 14)
ggplot: color aestheticggplot to color by continentggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country,
color = continent)) +
geom_line(lwd = 0.3) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy over time") +
theme_bw(base_size = 8)
ggplot: facetsggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country,
color = continent)) +
geom_line(lwd = 0.3) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy over time") +
theme_bw(base_size = 8) +
facet_wrap(continent~.)
ggplot: facetsggplot: legend optionsggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country,
color = continent)) +
geom_line(lwd = 0.3) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy over time") +
theme_bw(base_size = 8) +
facet_wrap(~ continent) +
theme(legend.position = c(0.8, 0.25))
ggplot: legend optionsggplot: legend optionsggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country,
color = continent)) +
geom_line(lwd = 0.3) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy over time") +
theme_bw(base_size = 8) +
facet_wrap(~ continent) +
theme(legend.position = c(0.8, 0.25))
ggplot: legend optionslegend.position = "none"ggplot: more on facetingggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country,
color = continent)) +
geom_line(lwd = 0.3) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy over time") +
theme_bw(base_size = 8) +
facet_grid(cols = vars(continent)) +
theme(legend.position = "none")
ggplot: more on facetingfacet_grid by colsggplot: adding a smoothggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country)) +
geom_line(lwd=0.1, alpha=0.5) +
geom_line(stat = "smooth", method = "loess",
aes(group = continent, color = continent)) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy over time") +
theme_bw(base_size = 8) +
facet_grid(cols = vars(continent)) +
theme(legend.position = "none")
alpha modifies the transparencyggplot: adding a smooth## `geom_smooth()` using formula 'y ~ x'
ggplot: stored ggplotsggplot object to a namemy.fist.plot <- ggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country)) +
geom_line(lwd=0.1, alpha=0.5) +
geom_line(stat = "smooth", method = "loess", #<<
aes(group = continent, color = continent)) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life expectancy over time") +
theme_bw(base_size = 8) +
facet_grid(cols = vars(continent)) +
theme(legend.position = "none")
ggplot: stored ggplotsmy.fist.plot
## `geom_smooth()` using formula 'y ~ x'
ggplot: stored ggplotsmy.fist.plot + theme(legend.position = "bottom")
## `geom_smooth()` using formula 'y ~ x'
Explore other relationships in the gapminder data using what you learned today, could be considering other variables in the data set, or using an alternative geometry or faceting with other variables. Just make one figure, but using as many of the concepts you learned as possible.
ggplotggplot() initializes a ggplot object
declares the input data and global aesthetics
add layers by using the + operator
geomgeom_[some geom](mapping = NULL, data = NULL, stat, ...)
mapping list of aesthethic assignments aes() for geom object
stat statistical transformation required for geom object
NULL setting indicating to inherit values from ggplot()
... other args, often aesthetics you want to set unconditionally of the data, e.g. color="green"Besides mapping onto x- and y-position variables can be assigned to geom aesthetics
Examples:
geom_point(aes(x=year,
y=lifeExp,
size = pop ))#: point size varies with `pop`
aes(..., color = continent)#: color varies with `continent`
aes(..., fill = continent)#: fill color varies with `continent`
aes(..., linetype = country)#: linetype varies with `country`
ggplotggplot(data = [dataframe],
mapping=aes(x = [var_x], y = [var_y],
color = [var_for_color],
fill = [var_for_fill],
shape = [var_for_shape]),
stat=[stat_transf],
position=[pos_adjust]
) +
geom_[some_geom]([geom_arguments]) +
... + # other geometries
facet_[some_facet]([formula]) +
xlab([an x label]) + ylab([a y label]) +
ggtitle(label = [a title], subtitle=[a subtitle]) +
scale_[some_axis]_[some_scale]([scale_arguments]) +
... # other options
ggplotggplot(data = [dataframe],
mapping=aes(x = [var_x], y = [var_y],
color = [var_for_color],
fill = [var_for_fill],
shape = [var_for_shape]),
stat=[stat_transf],
position=[pos_adjust]
) +
geom_[some_geom]([geom_arguments]) +
... + # other geometries
facet_[some_facet]([formula]) +
xlab([an x label]) + ylab([a y label]) +
ggtitle(label = [a title], subtitle=[a subtitle]) +
scale_[some_axis]_[some_scale]([scale_arguments]) +
... # other options
ggplotggplot(data = [dataframe],
mapping=aes(x = [var_x], y = [var_y],
color = [var_for_color],
fill = [var_for_fill],
shape = [var_for_shape]),
stat=[stat_transf],
position=[pos_adjust]
) +
geom_[some_geom]([geom_arguments]) +
... + # other geometries
facet_[some_facet]([formula]) +
xlab([an x label]) + ylab([a y label]) +
ggtitle(label = [a title], subtitle=[a subtitle]) +
scale_[some_axis]_[some_scale]([scale_arguments]) +
... # other options
ggplotggplot(data = [dataframe],
mapping=aes(x = [var_x], y = [var_y],
color = [var_for_color],
fill = [var_for_fill],
shape = [var_for_shape]),
stat=[stat_transf],
position=[pos_adjust]
) +
geom_[some_geom]([geom_arguments]) +
... + # other geometries
facet_[some_facet]([formula]) +
xlab([an x label]) + ylab([a y label]) +
ggtitle(label = [a title], subtitle=[a subtitle]) +
scale_[some_axis]_[some_scale]([scale_arguments]) +
... # other options
ggplotggplot(data = [dataframe],
mapping=aes(x = [var_x], y = [var_y],
color = [var_for_color],
fill = [var_for_fill],
shape = [var_for_shape]),
stat=[stat_transf],
position=[pos_adjust]
) +
geom_[some_geom]([geom_arguments]) +
... + # other geometries
facet_[some_facet]([formula]) +
xlab([an x label]) + ylab([a y label]) +
ggtitle(label = [a title], subtitle=[a subtitle]) +
scale_[some_axis]_[some_scale]([scale_arguments]) +
... # other options
The data set is comprised of 651 randomly sampled movies produced and released before 2016.
Data come from IMDB and Rotten Tomatoes.
The codebook is available here.
movies = readr::read_csv("data/movies.csv")
movies
## # A tibble: 651 x 32 ## title title_type genre runtime mpaa_rating studio thtr_rel_year ## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> ## 1 Fill… Feature F… Drama 80 R Indom… 2013 ## 2 The … Feature F… Drama 101 PG-13 Warne… 2001 ## 3 Wait… Feature F… Come… 84 R Sony … 1996 ## 4 The … Feature F… Drama 139 PG Colum… 1993 ## 5 Male… Feature F… Horr… 90 R Ancho… 2004 ## 6 Old … Documenta… Docu… 78 Unrated Shcal… 2009 ## 7 Lady… Feature F… Drama 142 PG-13 Param… 1986 ## 8 Mad … Feature F… Drama 93 R MGM/U… 1996 ## 9 Beau… Documenta… Docu… 88 Unrated Indep… 2012 ## 10 The … Feature F… Drama 119 Unrated IFC F… 2012 ## # … with 641 more rows, and 25 more variables: thtr_rel_month <dbl>, ## # thtr_rel_day <dbl>, dvd_rel_year <dbl>, dvd_rel_month <dbl>, ## # dvd_rel_day <dbl>, imdb_rating <dbl>, imdb_num_votes <dbl>, ## # critics_rating <chr>, critics_score <dbl>, audience_rating <chr>, ## # audience_score <dbl>, best_pic_nom <chr>, best_pic_win <chr>, ## # best_actor_win <chr>, best_actress_win <chr>, best_dir_win <chr>, ## # top200_box <chr>, director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>, ## # actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>
ggplot(data = movies, aes(x = audience_score)) + geom_histogram(binwidth = 5)
ggplot(data = movies, aes(y = audience_score, x = genre)) + geom_boxplot()
Terrible x-axis labels
ggplot(data = movies, aes(y = audience_score, x = genre)) + geom_boxplot() + theme(axis.text.x=element_text(angle = 45, hjust = 1))
Fixed using the
axis.text.x option in theme
ggplot(data = movies, aes(x = runtime)) + geom_density()
loess (the default)ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) + geom_point(alpha = 0.5) + geom_smooth()
ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm")
ggplot(data = movies, aes(x = genre)) + geom_bar() + theme(axis.text.x=element_text(angle = 45, hjust = 1))
ggplot(data = movies, aes(x = runtime, color = audience_rating)) + geom_density()
ggplot(data = movies, aes(x = runtime, fill = audience_rating)) + geom_density()
ggplot(data = movies, aes(x = runtime, fill = audience_rating)) + geom_density(alpha = 0.5)
ggplot(data = movies, aes(x = genre, fill = audience_rating)) + geom_bar() + theme(axis.text.x=element_text(angle = 45, hjust = 1))
ggplot(data = movies, aes(x = genre, fill = audience_rating)) +
geom_bar(position = "fill") + ylab("proportions") +
theme(axis.text.x=element_text(angle = 45, hjust = 1))
ggplot(data = movies, aes(x = genre, fill = audience_rating)) + geom_bar(position = "dodge") + theme(axis.text.x=element_text(angle = 45, hjust = 1))
The Scale is a realization of data values in terms of asthetic/physical values
Scale specifications have the form
scale_AESTHETIC_SCALENAME()AESTHETIC x, y, color, fill, linetype, size or shape
SCALENAME grey, gradient, hue, manual, continuous, etc
ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) + geom_point(alpha = 0.5) + scale_x_log10() + scale_y_sqrt()
ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) + geom_point(alpha = 0.5) + scale_x_continuous(trans="identity", breaks=seq(10,100,10), limits=c(1,100)) + scale_y_continuous(trans="identity", breaks=c(1,20,50,100), limits=c(1,100))
ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) + geom_point() + scale_color_viridis_d()
ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) + geom_point() + scale_color_brewer(palette = "Accent")
ggplot(data = movies, aes(x = runtime, fill = audience_rating)) + geom_density(alpha = 0.5) + scale_x_log10()
ggplot(data = movies, aes(x = runtime, fill = audience_rating)) +
geom_density() +
scale_x_log10() +
scale_fill_manual(values=c("#4B9CD3","#001A57"))
geom_bar(mapping = NULL,
data = NULL,
stat = "bin",
position = "stack",...)
stat statistically transforms input data (bin means bin and count)
position dodges for side-by-side bars or stack for additive bars
ggplot(data = movies, aes(x = audience_score)) + geom_bar(stat="bin")
ggplot(data = movies, aes(x = audience_score)) + geom_bar(stat="bin", binwidth = 20)
binwidth specifies the number of bins
Do the transformation stat=“bin” by hand
Cut audience_score into groups (0,20] (20,40] (40,50] (60,80] (80,100] with
cut(movies$audience_score, breaks=seq(0,100,5))
count.df <- as.data.frame(table(movies$audience_score))
Question: what stat argument do you need?
aud_scorecut <- cut(movies$audience_score, breaks=seq(0,100,20)) count.df <- as.data.frame(table(aud_scorecut)) count.df
## aud_scorecut Freq ## 1 (0,20] 13 ## 2 (20,40] 104 ## 3 (40,60] 164 ## 4 (60,80] 217 ## 5 (80,100] 153
No good, why?
ggplot(count.df, aes(x=aud_scorecut, y=Freq)) + geom_bar(stat="bin")
## Error: stat_bin() can only have an x or y aesthetic.
Ok, so how about this?
ggplot(count.df, aes(x=aud_scorecut)) + geom_bar(stat="bin")
## Error: StatBin requires a continuous x variable: the x variable is discrete.Perhaps you want stat="count"?
ggplot(count.df, aes(x=aud_scorecut, y=Freq)) + geom_bar(stat="identity")
A web application framework for R with which you can easily turn your analyses into interactive web applications
No HTML, CSS, or JavaScript knowledge required
Materials above are adapted from the following sources: